中国科学技术信息研究所--国家工程技术数字图书馆

确定要退出系统吗？

检索条件: 会议【作者】 ="Shaojun Wei"

1. Atomic Dataflow based Graph-Level Workload Orchestration for Scalable DNN Accelerators

[会议] Shixuan Zheng Xianjue Zhang Leibo Liu Shaojun Wei Shouyi Yin IEEE International Symposium on High Performance Computer Architecture 2022年28th届共 15 页

摘要 : To efficiently deploy state-of-the-art deep neural network (DNN) workloads with growing computational intensity and structural complexity, scalable DNN accelerators have been proposed in recent years, which are featured by multi-t... 展开 To efficiently deploy state-of-the-art deep neural network (DNN) workloads with growing computational intensity and structural complexity, scalable DNN accelerators have been proposed in recent years, which are featured by multi-tensor engines and distributed on-chip buffers. Such spatial architectures have significantly expanded scheduling space in terms of parallelism and data reuse potentials, which demands for delicate workload orchestration. Previous works on DNN’s hardware mapping problem mainly focus on operator-level loop transformation for single array, which are insufficient for this new challenge. Resource partitioning methods for multi-engines such as CNN-partition and inter-layer pipelining have been studied. However, their intrinsic disadvantages of workload unbalance and pipeline delay still prevent scalable accelerators from releasing full potentials.In this paper, we propose atomic dataflow, a novel graph-level scheduling and mapping approach developed for DNN inference. Instead of partitioning hardware resources into fixed regions and binding each DNN layer to a certain region sequentially, atomic dataflow schedules the DNN computation graph in workload-specific granularity (atoms) to ensure PE-array utilization, supports flexible atom ordering to exploit parallelism, and orchestrates atom-engine mapping to optimize data reuse between spatially connected tensor engines. Firstly, we propose a simulated annealing based atomic tensor generation algorithm to minimize load unbalance. Secondly, we develop a dynamic programming based atomic DAG scheduling algorithm to systematically explore massive ordering potentials. Finally, to facilitate data locality and reduce expensive off-chip memory access, we present mapping and buffering strategies to efficiently utilize distributed on-chip storage. With an automated optimization framework being established, experimental results show significant improvements over baseline approaches in terms of performance, hardware utilization, and energy consumption. 收起

关键词 : Schedules Tensors Scheduling algorithms Atomic layer deposition Simulated annealing Hardware System-on-chip

2. Upward Packet Popup for Deadlock Freedom in Modular Chiplet-Based Systems

[会议] Yibo Wu Liang Wang Xiaohang Wang Jie Han Jianfeng Zhu Honglan Jiang Shouyi Yin Shaojun Wei Leibo Liu IEEE International Symposium on High Performance Computer Architecture 2022年28th届共 15 页

摘要 : Monolithic SoCs can be decomposed into disparate chiplets that support integration with advanced pack-aging technologies. This concept is promising in reducing the manufacturing cost of large scale SoCs due to the higher yield rat... 展开 Monolithic SoCs can be decomposed into disparate chiplets that support integration with advanced pack-aging technologies. This concept is promising in reducing the manufacturing cost of large scale SoCs due to the higher yield rate and reusability of chiplets. The chiplets should be designed in a modular manner without holistic system knowledge so that they can be reused in different SoCs. However, the design modularity is a major challenge to the networks-on-chip (NoCs) of chiplets.New deadlocks may occur across both the chiplets and the interposer due to the integration, even if the NoC of each individually designed chiplet is deadlock free. However, conventional deadlock freedom approaches are unsuitable to handle such deadlocks because they require holistic knowledge and violate the modularity. Although there are several modular approaches that specifically target at integration-induced deadlocks, their routing is overly restricted and the injection control incurs additional latency. They also lack flexibility in dynamically changing topologies due to their complex software algorithm and the hard-wired components.In this paper, a key insight on the chiplet integration-induced deadlocks is gained, inspired by which a deadlock recovery framework (named UPP) is proposed. Specifically, it is verified that an integration-induced deadlock always involves a stalled upward packet moving from the interposer to the connected chiplet via the vertical link. Thus, UPP detects a deadlock by discovering the upward packet and recovers the system from deadlock by transmitting the upward packet to its destination. Hybrid flow control mechanisms are proposed to enable the upward packet to bypass the buffers and be transmitted via the normal router datapath. To guarantee the ejection of the upward packet after transmission, a lightweight protocol is proposed to reserve ejection queue entries of the network interface. Experimental results show that while adhering to design modularity, UPP provides an average runtime speedup of 3.1%∼10.3% with an area overhead of less than 4%. 收起

关键词 : Runtime Heuristic algorithms Software algorithms System recovery Routing Software Routing protocols

3. TFE: Energy-efficient Transferred Filter-based Engine to Compress and Accelerate Convolutional Neural Networks

[会议] Huiyu Mo Leibo Liu Wenjing Hu Wenping Zhu Qiang Li Ang Li Shouyi Yin Jian Chen Xiaowei Jiang Shaojun Wei Annual IEEE/ACM International Symposium on Microarchitecture Workshops 2020年53rd届共 15 页

摘要 : Although convolutional neural network (CNN) models have greatly enhanced the development of many fields, the untenable number of parameters and computations in these models yield significant performance and energy challenges in ha... 展开 Although convolutional neural network (CNN) models have greatly enhanced the development of many fields, the untenable number of parameters and computations in these models yield significant performance and energy challenges in hardware implementations. Transferred filter-based methods, as very promising techniques that have not yet been explored in the architecture domain, can substantially compress CNN models. However, their straightforward hardware implementation inherently incurs massive redundant computations, causing significant energy and time consumption. In this work, a highly efficient transferred filter-based engine (TFE) is developed to alleviate this deficiency, with CNN models compressed and accelerated. First, the filters of CNN models are flexibly transferred according to specific tasks to reduce the model size. Then, two hardware-friendly mechanisms are proposed in the TFE to remove duplicate computations caused by transferred filters, which can further accelerate transferred CNN models. The first mechanism exploits the shared weights hidden in each row of transferred filters and reuses the corresponding same partial sums, reducing at least 25% of repetitive computations in each row. The second mechanism can intelligently schedule and access the memory system to reuse the repetitive partial sums among different rows of the transferred filters with at least 25% of computations eliminated. Furthermore, an efficient hardware architecture is proposed in the TFE to fully reap the benefits of the two proposed mechanisms such that different types of networks are flexibly supported. To achieve high energy efficiency, the sub-array-based filter mapping method (SAFM) is proposed, where the process element (PE) sub-array is used as the elementary computational unit to support various filters. Therein, input data can be efficiently broadcast in each PE sub-array and the load can be stripped from each PE and intensively alleviated, which can dramatically reduce the area and power consumption. Excluding MobileNet-like networks that adopt depth-wise convolution, most mainstream networks can be compressed and accelerated by the proposed TFE. Two state-of-the-art transferred filter-based methods, i.e., doubly CNN and symmetry CNN are implemented by exploiting the TFE. Compared with Eyeriss, average speedup improvements of 2.93× and 3.17× are achieved in the convolutional layers of various modern CNNs. The overall energy efficiency can be improved by 12.66× and 13.31× on average. Compared with other state-of-the-art related works, the TFE can maximally achieve a parameter reduction of 4.0×, a speedup of 2.72× and an energy efficiency improvement of 10.74× on VGGNet. 收起

原文获取

4. TFE: Energy-efficient Transferred Filter-based Engine to Compress and Accelerate Convolutional Neural Networks

[会议] Huiyu Mo Leibo Liu Wenjing Hu Wenping Zhu Qiang Li Ang Li Shouyi Yin Jian Chen Xiaowei Jiang Shaojun Wei International Symposium on Multidisciplinary Studies and Innovative Technologies 2020年4th届共 15 页

摘要 : Although convolutional neural network (CNN) models have greatly enhanced the development of many fields, the untenable number of parameters and computations in these models yield significant performance and energy challenges in ha... 展开 Although convolutional neural network (CNN) models have greatly enhanced the development of many fields, the untenable number of parameters and computations in these models yield significant performance and energy challenges in hardware implementations. Transferred filter-based methods, as very promising techniques that have not yet been explored in the architecture domain, can substantially compress CNN models. However, their straightforward hardware implementation inherently incurs massive redundant computations, causing significant energy and time consumption. In this work, a highly efficient transferred filter-based engine (TFE) is developed to alleviate this deficiency, with CNN models compressed and accelerated. First, the filters of CNN models are flexibly transferred according to specific tasks to reduce the model size. Then, two hardware-friendly mechanisms are proposed in the TFE to remove duplicate computations caused by transferred filters, which can further accelerate transferred CNN models. The first mechanism exploits the shared weights hidden in each row of transferred filters and reuses the corresponding same partial sums, reducing at least 25% of repetitive computations in each row. The second mechanism can intelligently schedule and access the memory system to reuse the repetitive partial sums among different rows of the transferred filters with at least 25% of computations eliminated. Furthermore, an efficient hardware architecture is proposed in the TFE to fully reap the benefits of the two proposed mechanisms such that different types of networks are flexibly supported. To achieve high energy efficiency, the sub-array-based filter mapping method (SAFM) is proposed, where the process element (PE) subarray is used as the elementary computational unit to support various filters. Therein, input data can be efficiently broadcast in each PE sub-array and the load can be stripped from each PE and intensively alleviated, which can dramatically reduce the area and power consumption. Excluding MobileNet-like networks that adopt depth-wise convolution, most mainstream networks can be compressed and accelerated by the proposed TFE. Two state-of-the-art transferred filter-based methods, i.e., doubly CNN and symmetry CNN are implemented by exploiting the TFE. Compared with Eyeriss, average speedup improvements of 2.93× and 3.17× are achieved in the convolutional layers of various modern CNNs. The overall energy efficiency can be improved by 12.66× and 13.31× on average. Compared with other state-of-the-art related works, the TFE can maximally achieve a parameter reduction of 4.0×, a speedup of 2.72× and an energy efficiency improvement of 10.74× on VGGNet. 收起

关键词 : n/a

5. TFE: Energy-efficient Transferred Filter-based Engine to Compress and Accelerate Convolutional Neural Networks

摘要 : Although convolutional neural network (CNN) models have greatly enhanced the development of many fields, the untenable number of parameters and computations in these models yield significant performance and energy challenges in ha... 展开 Although convolutional neural network (CNN) models have greatly enhanced the development of many fields, the untenable number of parameters and computations in these models yield significant performance and energy challenges in hardware implementations. Transferred filter-based methods, as very promising techniques that have not yet been explored in the architecture domain, can substantially compress CNN models. However, their straightforward hardware implementation inherently incurs massive redundant computations, causing significant energy and time consumption. In this work, a highly efficient transferred filter-based engine (TFE) is developed to alleviate this deficiency, with CNN models compressed and accelerated. First, the filters of CNN models are flexibly transferred according to specific tasks to reduce the model size. Then, two hardware-friendly mechanisms are proposed in the TFE to remove duplicate computations caused by transferred filters, which can further accelerate transferred CNN models. The first mechanism exploits the shared weights hidden in each row of transferred filters and reuses the corresponding same partial sums, reducing at least 25% of repetitive computations in each row. The second mechanism can intelligently schedule and access the memory system to reuse the repetitive partial sums among different rows of the transferred filters with at least 25% of computations eliminated. Furthermore, an efficient hardware architecture is proposed in the TFE to fully reap the benefits of the two proposed mechanisms such that different types of networks are flexibly supported. To achieve high energy efficiency, the sub-array-based filter mapping method (SAFM) is proposed, where the process element (PE) subarray is used as the elementary computational unit to support various filters. Therein, input data can be efficiently broadcast in each PE sub-array and the load can be stripped from each PE and intensively alleviated, which can dramatically reduce the area and power consumption. Excluding MobileNet-like networks that adopt depth-wise convolution, most mainstream networks can be compressed and accelerated by the proposed TFE. Two state-of-the-art transferred filter-based methods, i.e., doubly CNN and symmetry CNN are implemented by exploiting the TFE. Compared with Eyeriss, average speedup improvements of 2.93× and 3.17× are achieved in the convolutional layers of various modern CNNs. The overall energy efficiency can be improved by 12.66× and 13.31× on average. Compared with other state-of-the-art related works, the TFE can maximally achieve a parameter reduction of 4.0×, a speedup of 2.72× and an energy efficiency improvement of 10.74× on VGGNet. 收起

关键词 : n/a

原文获取

6. ABC-DIMM: Alleviating the Bottleneck of Communication in DIMM-based Near-Memory Processing with Inter-DIMM Broadcast

[会议] Weiyi Sun Zhaoshi Li Shouyi Yin Shaojun Wei Leibo Liu ACM/IEEE Annual International Symposium on Computer Architecture 2021年48th届共 14 页

摘要 : Near-Memory Processing (NMP) systems that integrate accelerators within DIMM (Dual-Inline Memory Module) buffer chips potentially provide high performance with relatively low design and manufacturing costs. However, an inevitable ... 展开

关键词 : Tensors Algebra Bandwidth Memory modules Computer architecture Programming Software

7. ABC-DIMM: Alleviating the Bottleneck of Communication in DIMM-based Near-Memory Processing with Inter-DIMM Broadcast

[会议] Weiyi Sun Zhaoshi Li Shouyi Yin Shaojun Wei Leibo Liu International Symposium on Computer Architecture 2021年48th届共 14 页

关键词 : Tensors Algebra Bandwidth Memory modules Computer architecture Programming Software

原文获取

8. FuseKNA: Fused Kernel Convolution based Accelerator for Deep Neural Networks

[会议] Jianxun Yang Zhao Zhang Zhuangzhi Liu Jing Zhou Leibo Liu Shaojun Wei Shouyi Yin International Symposium on High Performance Computer Architecture 2021年27th届共 14 页

摘要 : Bit-serial computation has been a prevailing convolution method to accelerate varying-precision DNNs by slicing a multi-bit data into multiple 1-bit data and transforming a multiplication into multiple additions, where additions o... 展开

关键词 : Convolution Neural networks Memory management Benchmark testing Energy efficiency Computational efficiency Acceleration

原文获取

9. FuseKNA: Fused Kernel Convolution based Accelerator for Deep Neural Networks

[会议] Jianxun Yang Zhao Zhang Zhuangzhi Liu Jing Zhou Leibo Liu Shaojun Wei Shouyi Yin IEEE International Symposium on High Performance Computer Architecture 2021年27th届共 14 页

关键词 : Convolution Neural networks Memory management Benchmark testing Energy efficiency Computational efficiency Acceleration

10. CATCAM: Constant-time Alteration Ternary CAM with Scalable In-Memory Architecture

[会议] Dibei Chen Zhaoshi Li Tianzhu Xiong Zhiwei Liu Jun Yang Shouyi Yin Shaojun Wei Leibo Liu International Symposium on Multidisciplinary Studies and Innovative Technologies 2020年4th届共 14 页

摘要 : TCAM (Ternary Content-Addressable Memory) is the essential component for high-speed packet classification in modern hardware switches. However, due to its relatively slow update process, recent advances in Software-Defined Network... 展开

关键词 : packet classification ternary content-addressable memory processing in-memory hardware accelerator